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Abstract — Cody & Waite argument reduction technique works 
perfectly for reasonably large arguments but as the input grows 
there are no bit left to approximate the constant with enough 
accuracy. Under mild assumptions, we show that the result 
computed with a fused-multiply-add provides a fully accurate 
result for many possible values of the input with a constant 
almost accurate to the full working precision. We also present 
an algorithm for a fully accurate second reduction step to reach 
double full accuracy (all the signiflcand bits of two numbers 
are significant) even in the worst cases of argument reduction. 
Our work recalls the common algorithms and presents proofs of 
correctness. All the proofs are formally verified using the Coq 
automatic proof checker. 

Index Terms — Argument reduction, fma, formal proof, Coq. 



I. Introduction 

Methods that compute elementary functions on a large 
domain rely on efficient argument reduction techniques. The 
idea is to reduce an argument x to u that falls into a small 
interval to allow efficient approximations [l]-[4]. A commonly 
used argument reduction technique [1], [5]-[7] begins with 
one positive FPN (floating point number) C\ to approximate 
a number C > (usually irrational but not necessarily). 
Examples include C = tt/2 or tt or 2ir for trigonometric 
functions sin a; and cosx, and C — In 2 for exponential 
function e x . 

Let x be a given argument, a FPN. The argument reduction 
starts by extracting \ as defined by 



so that 



x/Ct 



X 



Then it computes a reduced argument x — xCi- The result 
is exactly a FPN as it is defined by an IEEE-754 standard 
remainder operation. But division is a costly operation that is 
avoided as much as possible. Some authors, see for example 
[1], [3], [7], and 
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where k is an integer used to reference a table of size 2 . 
This replacement is computational efficient if u 



zCi 



(1.2) 



is a FPN [8]. 

Sometimes the computed value of u is not sufficiently 
accurate, for example if u is near a multiple of C, the loss 
of accuracy due to the approximation C\ « C may prevail. 
A better approximation to C is necessary to obtain a fully 
accurate reduced argument. If this is the case we use C 2 , 
another FPN, roughly containing the next many bits in the 
signiflcand of C so that the unevaluated C\ + C 2 ~ C much 
better than C\ alone. When equation (11.21 does not introduce 
any rounding error, the new reduced argument is not u but v 
computed by 



zC 2 . 



(1.3) 



To increase once again the accuracy, the error of ( II. 5b need to 
be computed (see Section IVTi. too to obtain v\ and v 2 exactly 
satisfying 



V\ + v 2 



zCo. 



(1.4) 



The last step creates a combined reduced argument stored in 
the unevaluated sum v\ + w with 2p significant bits 



w 



v 2 - zC 3 



(1.5) 



Whether v\ (or v\ and w) is accurate enough for computing 
the elementary function in question is subject to further error 
analysis on a function-by-function basis [9]. But this is out of 
the scope of this paper. 

|http: //www. Intel . com/software/products/opensource/l%&fe^Bffetfl chni q ue t 5 ] is presented in Figures [TJ 

ana |z] wnere 0(0) denotes the FPN obtained from rounding 
a in the round-to-nearest mode. Those are examples when no 
fma is used. The sizes of the rectangles represent the precision 
(length of the signiflcand) of each FPN and their positions 
indicate magnitude, except for z and C\ whose respective 
layouts are only for showing lengths of significands. The light 
gray represents the cancellations: the zero bits due to the fact 
that \x — o(z x C\)\ <C \x\. The dark grey represents the 
round-off error: the bits that may be wrong due to previous 
rounding(s). 



introduce another FPN R that approximates 1/C and the 
argument reduction replaces the division by a multiplication 
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Figure Q] presents the ideal behavior. Figure [2] presents the 
behavior when the significand of z is longer. Then, fewer bits 
are available to store the significand of C\ in order for zC\ 
to be stored exactly. The consequence is a tremendous loss 
of precision in the final result: as C\ must be stored in fewer 
bits, the cancellation in the computation of x — zC\ is smaller 
and the final result may be inaccurate. 

We want to take advantage of the fused-multiply-add (fma) 
instructions. Some machines have hardware support for it, such 
as machines with HP/Intel® Itanium® Microprocessors [1] 
and IBM PowerPC Microprocessors, and this instruction will 
also be added to the revision of the IEEE-754 standard. The 
current draft can be found at 

|http : / /www . validlab . com/7 54R/| 

It is obvious that some bits of x and zC\ will cancel each other 
as z is computed such that x ss zC\, but it is not clear how 
many of them will and under what condition(s). Consequently 
if accuracy calls for x — zC\ to be calculated exactly (or to 
more than p bits in the significand), how do we get these bits 
efficiently? This question is especially critical if the working 
precision is the highest available on the underlying computing 
platform. 

In this paper, we will devise easily met conditions so that 
x — zC\ can be represented exactly by a FPN, and thus it can 
be computed by one instruction of the fma type without error. 
This technique is presented in Figure [3] The understanding is 
the same as in Figures Q] and [2] The cancellation is greater as 
C\ can be more precise. The idea is that the rounding in zC\ 
is avoided thanks to the fma: zC\, a 2p-bit FPN, is virtually 
computed with full precision and then subtracted from x. This 
subtraction is proved to be exact as x « zC\. The fact of 
x — zC\ being a FPN is used by the library of [1], [7] with 
no formal justification until [8]. 

The motivations of this work are similar to those presented 
in [8] and Section [TT] recalls briefly some useful prior-art 
from the authors [8], [10]. However, the rest of the paper 
presents entirely new results. The theorems and their proofs 
are different from the ones presented in [8]. The changes are 
necessary to facilitate verification with an automatic proof 
checker. Moreover, the results have been improved, and are 
simpler to grasp and new results have been added thanks to 
this simplification and to a better understanding of the FPNs 
relationships due to the formal proof. 

In a floating-point pen-and-paper proof, it is difficult to be 
absolutely sure that no special case is forgotten, no inequality 
is erroneous, and no implicit hypothesis is assumed, etc. 
All the proofs presented in this paper are verified using our 
specification of generic floating point arithmetic [10] and 
Coq proof assistant [11]. This approach has already been 
proven successful in hardware or software applications [12]- 
[15]. The drawback is a long and tiresome argumentation 
versus the proof checker that will ascertain each step of 
the demonstration. The corresponding scripts of proofs are 
available online at 

|http : / /www . net lib ■ org/f p/ fp2 . tgz| 

We indicate for each theorem its Coq name. The developments 



presented here are located in the FArgReduct [2,3,4] .v 
files. 

The rest of this paper is organized as follows. Section [TT] 
recalls theorems on the number of cancelled bits of two close 
FPNs (extensions of Sterbenz's theorem [16]). In Section ITTTl 
we present the Coq verified theorem about the correctness of 
the algorithm that produces z in dl.lt and that satisfies the 
conditions of the following theorems. The demonstration of 
the main result, i.e. the correctness of the first reduction step, 
is then described in Section |IV] In Section [V] we give new 
algorithms and results about a very accurate second step for 
the argument reduction. Section [VTJ concludes the work of this 
paper. 

Notation. Throughout, denotes the floating point subtrac- 
tion. {A'jfma denotes the result by an instruction of the fused- 
multiply-add type, i.e., the exact ±a ± b x c after only one 
rounding. FPNs use p digits, hidden bit (if any) counted, in the 
significand or otherwise explicitly stated. We denote o(a) the 
FPN obtained from rounding a in the round-to-nearest mode 
with p digits and o m (a) if we round to m digits instead of p. 
We denote by ulp(-) the unit in the last place of a p-digit FPN 
and ulp o2 (-) = ulp(ulp(-)). The smallest (subnormal) positive 
FPN is denoted by A. 

II. Exact Subtraction Theorems 

These theorems will be used in Section HVl to guarantee that 
there will be enough cancellation in x — zC\ so that it can be 
computed exactly by one fma type instruction, or equivalently, 
to assure x — zC\ fits into one FPN. 

A well-known property [16], [17] of the floating point 
subtraction is the following. 

Theorem 1 (Sterbenz in Fprop.vj: Let x and y 
be FPNs. If y/2 <x<2y, then x - y is a FPN. This 
is valid with any integer radix (3 > 2 and any precision 
P > 2. 

We extend Sterbenz's theorem to fit the use of a fused- 
multiply-add that may create a higher precision virtual number 
whose leading digits are canceled to the working precision 
as explained in Figure [4] when x and y are sufficiently near 
one another, cancellation makes the result exactly fit a smaller 
precision. 

Theorem 2 (SterbenzApprox2): Let x and y be p\- 
digit FPNs. If 

- < x < (1 + I3 P2 ~ P1 ) y, 

then x — y is a p2-digit FPN. This is valid with any 
different significand sizes pi,p2 > 2, and any integer 
radix f3 > 2. 

The proofs are omitted as they appeared in other publica- 
tions [8], [18]. It is worth mentioning that Theorem[2]do not 
require p\ > p2 or p2 > p\ ■ 

From now on, all FPNs are binary. The underlying machine 
hardware conforms to IEEE-754 floating point standards [19], 
[20]. This implies that rounding does not introduce a rounding 



z x C\ is exact 
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cancellation x — 2 x C\ is exact 



I °(z x C 2 ) 

o(x — s x Ci — o(z x C2)) I round-off error 



x — z x C\ is exact 

0(Z X C 2 ) 
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Fig. 1. Reduction technique works for z sufficiently small. 



final precision 

Fig. 2. Cody Waite technique fails as z grows, i.e. u is not accurate 
enough. 
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cancellation x — z X C\ is exact 



z X C 2 



o(x — z X C\ - z X C'z) ~] round-off error 



final precision 

Fig. 3. Argument reduction with exact cancellation in a fused-multiply-add. 



Theorem 3 (arg.reduct-existS-k-zHj: Assume 

• V > 3, 

> x is a p-bit FPN, 

• R is a positive normal p-bit FPN, 

. z = {3 • 2P- N - 2 + xR} ima e 3 • 2P- N -\ 
. |*| > 2 1 -*, 

. 2- w is a FPN. 
Then there exists an integer I satisfying 2 < I < p — 2 
such that 

« \z2 N \ is an ^-bit integer greater than 2 t ~ 1 , and 
. \xR-z\ < 2- N -\ 



error when the exact result is a FPN. Unless explicitly stated, 
the default rounding mode is round-to-nearest with ties broken 
to the even significand. 



III. About the Algorithm for z 

The computation of z can be done efficiently as 

z = {xR + (j} fma - a, (III.l) 

where a is a pre-chosen constant. The technique is adapted 
from [1, Chap. 10] who used an idea attributed by the author 
to C. Roothaan in his work for HP's vector math library for 
Itanium. The explanation is in Figure [5] here we choose a = 
3 • 2 P ~ N ~ 2 for a z having its last bit at exponent — N. 

In realizing dl- lb - the wanted results are that z2 N is an 
integer, and that \xR — z\ < 2~ N ~ 1 . We may also need that 
the precision needed for z is smaller or equal to p — 2. Here 
is the theorem, verified by Coq. 



pi digits 



z - y \ 

P2 digits 

Fig. 4. Extension of Sterbenz's theorem. 



In short, if z is computed as explained and x is not too 
big, then z is a correct answer, meaning it fulfills all the 
requirements that will be needed in Theorem [4] in the next 
section. 

—N 

xB 
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{3 ■ 2P-N-2 + xfl} fma IlltlOODIlO; 
z={3- &- N - 2 + xfl.} )ma 6 3 ■ 2P-X-'' ^^^^ 

Fig. 5. Algorithm for computing z. 

For Intel's double extended precision, this technique is 
perfectly adapted for range reduction with argument between 
— 2 63 and 2 63 when R is in the order of 0(1). This argument 
range coincides with what is in Intel's manual [21] for FSIN, 



FCOS, FPTAN and FSINCOS. A quick justification is for 
C = 2ir and modest N, say N = for an example, 
\xR\ < 2 63 /(2tt) gives \xR\ < 2 62 - 1. 

For the exponential function, any argument larger than 
11356 overflows in the double extended and quad precisions, 
and t < p — 2 is easily satisfied. 



IV. Main Results 

We now present the conditions under which x — zC\ can 
be represented exactly by a FPN, and thus it can be computed 
by {x — zCijfma without error. As in SectionH] R w 1/C and 
Ci ps C. We suppose that C > and C 7^ 2 J for any j. 

The idea is the use of a fused-multiply-add that may create 
a higher precision virtual number that cancels to the working 
precision. Figure [6] explains the idea: if z is an £-bit integer 
and the significand of C\ uses p — q bits, it takes up to p — q+£ 
bits to store the significand of zC\. And as zC\ and x are near 
enough, the final result fits into p bits. The notation rax stands 
for the significand of X and ex its exponent. 

p — q bits 




p — q + £ bits 



p bits 



Rounding (or not) 



p bits 



Fig. 6. Fused-multiply-add used to create and cancel a higher precision 
virtual number. 



We want to give enough hypotheses on the inputs to 
guarantee that x — zC\ will be computed without error. 

We define the exponent sr of R as the only integer such 
that 2 eR < R < 2 en+1 . We want to set the q least significant 
bits of G\ to zero. Since C\ should be as accurate as possible, 
we set C\ ps 1/R to the nearest FPN with p — q significant 
bits. From this, we deduce that 2" eR_1 < C\ < 2~ en and 
that the distance between C\ and 1/R is less than half an ulp 
(in p — q precision) therefore 



< 2 



-en-l-(p-g) 



We now define S = RC\ — 1, and we deduce a bound on 
its magnitude from the previous inequalities 

\5\ < 2 q - p . 

Let z be as defined by (II.lt with the conditions on z and s 
given there. We assume for the moment that z ^ 0. Theorem[2] 
can be used if we bound xj{zC\) and its reciprocal by 1 + 



2 q . We have the following equalities: 

x xR 

~z~C[ = zRCi 
z + s 



zRd 



= 1 



z) 1 + 5' 

We recall that z = k2~ N and that k is an integer using I bits, 
and we deduce on the other hand 



-N+e-i 



; \z\ < 2 



-N+l 



to bound 



< 2" 



Rewriting the condition of Theorem [2] and taking advantage 
of preceding results, we arrive at the point to prove both 

1 1 o-e 



1 + 1*1 
1 + 1*1 



< l + 2 q 



< l + 2 q ~ 



-£ 



(IV. 1) 



(IV.2) 



Conditions ( II V. U and ( IIV.2b are checked using functional 
analysis on polynomials and homographic functions for any 
permitted value of A = 2~ e . Since z is both a machine number 
and a non-zero £-bit FPN (1 < £ < p). From Section UTTl the 
algorithm used to produce z implies I < p — 2. We will use a 
more generic condition: 

2 1 ~ p < A = 2- e < -. 

~ 2 

We will now explain what are the successive requirements to 
guarantee that both (II V. Il l and (IIV.2b are fulfilled. 

a) Condition MV.lj : We want to guarantee that 

1 + 2- £ 

j — 29-^ — 1 + 1*1- The homographic function 



1 



1 + A 



1 + 2<i- e 1 + A2i 

we want to bound is maximized at A = 2 1_p and it is sufficient 
to check if (1 + 2 1 ~p)/(1 + 2 1 ~P2 q ) < 1 + \S\. We use the 
bound on \S\ and we introduce B = 2 q . We have left to prove 
that 

(1 + 2 1 - p )/(l + 2 1 - p B) < 1 - B2- p . 

This is equivalent to check if the second order polynomial 
2 1 ~ P B 2 -B + 2 < 0. The inequality is satisfied for B between 
the two roots 2 P ~ 2 (l ± \fl - 2 4 ~p^J . Thus it is sufficient to 
have B > 4 for all precisions. 

b) Condition (\IV.2l : We want to guarantee that 1 + \ 5\ < 
(1 + 2<?- £ )(l - 2-'). We introduce A and B as before, so we 
have left to prove 

1 + 1*1 < (1 + AB)(1-A). 

We assume that B > 4 from the preceding paragraph. The 
polynomial 

(1 + AB)(1 - A) = (1 + 2 q - e )(l - 2- e ) 



is minimized at A = 2 1 p and it is sufficient to check if 
(1 + \5\) < (1 + 2 1 -PB)(l - 2 1 ~p). From the bound on \S\, 
we now have to check if 

(1 + B2- p ) < (1 + 2 1 ~ p B)(l - 2 1 - p ) 

which is true for any precision. 

This proof is rather long and complex. We therefore verified 
it in Coq to be sure there is no mistake. It also gives us more 
precise and sharp hypothesis than if we would do that only 
by pen-and-paper. All hypotheses have to be clearly written 
so that the proof can be checked. There is no easy way to say 
"we assume there is no Underflow" or "that the precision is 
big enough". This leads to long theorems (at least longer than 
what we are used to), but precise and correct ones: 

Theorem 4 (Fmac_arg_reduct_correctl): 
Assume 

• V > 3, 

« x is a p-bit FPN, 

• R is a positive normal p-bit FPN, 
. 2 <q<p- 1, 

• C\ is the (p — g)-bit FPN obtained by rounding 
1 /R to p — q bits using round-to-nearest mode, 

• C\ is not exactly a power of 2, 

. Ci > 2 p ~ q+inax ( 1 > N ~ 1 * > X, 

. 2<£<p-l, 

• \z2 N \ is an ^-bit integer greater than 2 £_1 , 
. \xR-z\ < 2~ N ~\ 

. q<l 
Then x — zC\ is a p-bit FPN. 

In short, if C\ is rounded to the nearest from 1/ R with p— q 
bits and q > 2 and z is not too small, then the f ma does not 
make any round-off error. 

Automatic proof checking also prompted us that the exact 
behavior may be difficult to obtain for z = 2~ N and x close 
to 2~ N_1 R. This case was excluded in Theorem 0] under the 
hypothesis that 2 < I, but it will be included in the next 
theorem which focuses on q = 2 as this situation leads to C\ 
as close as possible from C and thus has more practical value. 
For completeness and theoretical interest a theorem similar to 
Theorem H] but valid for all 2 < 2 < p — 1 is presented in the 
appendix. 

Assume q = 2 in the rest of this section. When z = 2~ N , 
then x < 2C\ x 2 as xR is approximated by z = 2~ N . We 
can also deduce that 

Ci x 2-" 
1 + 2 2 -p ~ X - 

When C\ x 2~ N /2 < x, Sterbenz's theorem (Theorem[TJ can 
be applied and x — C\ x 2~ N is representable. If not, then 

d x 2"^ Ci x 2~ N 

1 + 22-p ~ X< 2 ■ 

Since C\ is a (p — 2)-bit FPN and not exactly a power of 2 as 
a p-bit FPN, then C\ is at least 4 ulps away from a power of 2. 
This is because as a p-bit FPN, C\ is worth 2 e x l.bb • ■ ■ bOO, 
where at least one of the b's must be 1; therefore the C\ that 
comes closest to a power of 2 is either 2 e x 1.0 ■• ■ 0100 or 



2 e x 1.11 ■ ■ • 100. Both are 4 ulps away from a power of 2. This 
distance and the preceding inequality are enough to guarantee 
that the exponent of x is the exponent of C\ minus N + 1. 
After a few computations, we finish with x — C\ x2~ N being 
a FPN, regardless of x. 

A few peculiar cases have been omitted in the sketch of 
this proof. Automatic proof checking allows us to trustfully 
guarantee that these cases have been all checked in our pub- 
licly available proof scripts. The only surprising condition is 
presented in this section. The other cases are easily generalized 
from Theorems [3] and [4] So just by wrapping these two results 
together, we can state the following theorem in its full length, 
verified with Coq. 

Theorem 5 fFmac.arg.reduct.correct3): 
Assume 

• p > 3, 

• x is a p-bit FPN, 

• R is a positive normal p-bit FPN, 

« Ci is the (p — 2) -bit FPN obtained by rounding 
1/R to p — 2 bits using round-to-nearest mode, 

• C\ is not exactly a power of 2, 
. Ci > 2P+ niax (- 1 > JV )A, 

. z = {3 • 2P- N - 2 + xR} fma e 3 • 2P- N - 2 , 

. \xR\ < 2P~ N ~ 2 - 2"^, 
. 2- N is a FPN. 
Then x — zC\ is a p-bit FPN. 

In short, if C\ is rounded to the nearest from 1/R with p — 2 
bits and z is computed as usual, then the fma does not make 
any round-off error. In Tables HI and HH we present constants R 
and C\ for 7r and ln(2). These constants are for the exponential 
and the fast reduction phase of the trigonometric functions [1], 
[3], [9], [22]. 

The hypotheses may seem numerous and restrictive but they 
are not. As R and C\ are pre-computed, the corresponding 
requirements can be checked beforehand. Moreover, those 
requirements are weak: for example with 0<=A r <=10in 
double precision, we need C\ > 2 -1011 w 4.510" 305 . There 
is no known elementary function for which C\ ever comes 
near a power of 2. The only nontrivial requirement left is the 
bound on 

V. Getting More Accurate Reduced Arguments 

As we pointed out in the introduction in Section [I] some- 
times the reduced argument u = x — zC\ is not accurate 
enough due to the limited precision in C\ as an approximation 
to C. When this happen another FPN C2 containing the lower 
bits of the constant C has to be made available and the new 
reduced argument is now x — zC\ — zC-2- Assume that the 
conditions of Theorem [5] hold. In particular C\ has p — 2 bits 
in its significand. 

The number x — zC\ — zC^ can be computed exactly [23] 
as the sum of two floats. But here because we know certain 
conditions on z, C\, and Ci as FPNs, we can do it faster. 
Inspired by [23], we propose the following Algorithm 15. II to 
accomplish the task. It is built upon two known algorithms: 



Example of value for R = 



TABLE I 

o(l/C), Ci ROUNDED TOp - 2 BITS, C*2 OBTAINED FROM ALGORITHM ^. 21 AND C3, FOR C 
EASILY LEADING TO C = 2n OR C = tt/2 



— 7T, 



Precision 


Single 




Double 




Double extended 




Quad 




R 


10680707-2" 


25 


5734161139222659-2" 


-54 


11743562013128004906-2" 


-65 


6611037688290699343682997282138730-2" 


ill 


Ci 


13176796-2" 


22 


7074237752028440 ■ 2" 


-51 


14488038916154245684-2" 


-62 


8156040833015188200833743081374136-2" 


ill 


c 2 


-11464520-2 


-45 


4967757600021504-2" 


105 


14179128828124470480-2" 


126 


9351661544631751449372323967920768-2" 


226 


r. 


-15186280-2 


-67 


7744522442262976 ■ 2 " 


155 


10700877088903390780 -2" 


189 


-9186378203702558149401308890796140-2 


-334 



TABLE II 

Example of value for R. = o(l/C), Ci rounded Top - 2 bits, C2 obtained from Algorithm ^. 21 and C3, for C = ln(2) 



Precision 


Single 




Double 




Double extended 




Quad 




R 


12102203-2" 


-23 


6497320848556798-2" 


-52 


13306513097844322492-2" 


-03 


7490900928631539394323262730195514-2" 


112 


Ci 


11629080-2" 


-24 


6243314768165360-2" 


-53 


12786308645202655660-2 


-64 


7198051856247353947080814903691240-2" 


113 


c 2 


-8577792-2 


-52 


-7125764960002032-2 


-106 


-15596301547560248640-2 


-130 


-5381235925004637553074520129202340-2 


-224 


c 3 


-8803384-2 


-72 


-7338834209110452-2 


-161 


-13766585803531045332-2 


-192 


-9437982846677142208552339635087788-2 


-338 



• Fast2Mult(.T,y) that computes the rounded product of x 
and y and its error (2 flops) [24]. 

• Fast2Sum(x,y) that computes the rounded sum of x and 
y and its error (3 flops), under the assumption that either 
x = 0, or y = 0, or | x > \y\, or there exist integers 
n x ,e x ,n y ,e y such that x = n x 2 £l and y = n y 2 Cy and 
e x > e y [10]. 

Algorithm 5.1 (Super accurate argument reduction): 
^\ The correctness of this algorithm is only guaranteed 
JL under the conditions of Theorem [6] It does not 
work with any C\ , C 2 ! 

u = o(x — zCi), 

vi = o(u - zC 2 ), 

(pi,P2) = Fast2Mult(z,C 2 ), 

(iit^a) = Fast2Sum(w, — pi), 

V 2 = °(°(°(tl-Vi)+t 2 )-p2)- 



The first requirements are very similar to the previous ones. 
The "no underflow" bound on C\ has been raised, but is still 
easily achieved by real constants. For a typical N between 
and 10 used by the existing elementary math libraries in IEEE 
double precision, it suffices that C > 10 -288 . 

The most important add-ons are the requirements on C 2 . 
it must be much smaller than C\ (it is near the difference 
between the constant C and Ci). And C2 must not be "too 
precise". In fact, C\ + C2 will have 2p — 4 bits as shown in 
Figure [7] If by chance, there are a lot of zeroes just after C\, 
we cannot take advantage of that to get a more precise C 2 - 
This is a real drawback, but it does not happen very often that 
many zeroes are just at the inconvenient place. 



Ci Ci 








1 optimal Ci 




optimal C2 









possible 0(s) 8ulp o2 (Ci) 



Fig. 7. Respective layouts of our Ci and C2 compared to optimal values. 

This algorithm may seem simple but it is a very powerful 
tool. It is exact and it is very fast: the generic algorithm [23] 
costs 20 flops while this one costs only 9 flops! More, the 
result is more usable than expected as it fits in only one float 
instead of two in the general case. 

As for the computation of C 2 , the requirements are rather 
low: there are several C 2 fulfilling them. It may be useful 
to choose one of them in order to have the bigger or the 
smaller C 2 possible. Algorithm [52] gives one way to compute 
a convenient C 2 . 

The idea of the proof for Theorem [6] is a careful study of 
the possible exponents for the involved FPNs. We first prove 
that x is an integer multiple of 2~ Ar ulp(Ci). This is done 
for whether z is 2~ N or not to guarantee the correctness of 
Fast2Sum. 

We then prove that t\ — v\ fits in a FPN. This proof is 
obtained by noticing that t\ and v\ are integer multiples of 



Theorem 6 (FArgReduct4 . v file): Assume 

• p > 4, 

• x is a p-bit FPN, 

• R is a positive normal p-bit FPN, 

« C\ is the (p — 2) -bit FPN obtained by rounding 
1 /R to p — 2 bits using round-to-nearest mode, 

• Ci it is not exactly a power of 2, 

. z = {3 ■ 2P~ N ~ 2 + xi?} fma G 3 ■ 2P~ N ~ 2 , 

. |a;.R| < 2P~ N ~ 2 - 2~ N , 

• 2~ N is a normal p-bit FPN, 

• C\ > 2 p+niax< -~ 1 ' p+N A, 

• C 2 is a FPN and an integer multiple of 8ulp o2 (Ci), 
. \C 2 \ < 4ulp(Ci), 

• vi and v 2 are computed using Algorithm 15. II 
Then Fast2Sum works correctly and we have the math- 
ematical equality v ! + v 2 = x — zC\ — zC 2 (all the 
computations of the last line indeed commit no rounding 
errors). 



2- A, - 1 ulp ^(Ci) and that that \h - Vl \ < 2P- A '- 1 ulp 0/ (C*i). 

The next step is about t\ — V\ + t<i = u — pi — Vi being a 
FPN. We do it similarly as all these quantities are also integer 
multiples of 2- Ar - 1 ulp o2 (Ci) and as we easily have that \t± — 
V!+t 2 \ < 2P- N - 1 ulp o2 (C 1 ). 

We finally prove that t\ —V\ +t.2—p2 = u — zC<i —v\ fits in 
a FPN. Its least significant non-zero bit is at most shifted N 
times down compared to the least significant non-zero bit of 
C2. For this reason, we require that C2 is an integer multiple 
of 8ulp o2 (Ci). 

This proof needs a careful study of the relationships between 
the various floats and their exponent values. The formal 
proof and its genericity allowed us a better understanding 
of the respective layouts of the FPNs, that is the key of the 
correctness of Algorithm 15.11 

Algorithm 5.2 (Computation of C%): Let C be the exact 
constant (for example, 7r or In 2). 

R = ° p (l/C), 
Ci = o P - 2 (l/-R), 

and take C2 to be the first many significand bits of C — C\ 
so that its least non-zero bit must be greater than or equal to 
log 2 (ulp(Ci)) - p + 4 = log 2 (gulped)), e.g., 

(C-d) 



C 2 = 



8ulp o2 (C0 

where [ J is one of the round-to-integer operations. 

This C2 has all the expected properties except that we do 
not know for sure if IC2 < 4ulp(Ci). Note that C\ is not 
gotten by directly rounding C but rather Ci = o (l/ o (^7)). 



8ulp o2 (C : 



Theorem 7 fganma2_lej: Assume 

• P > 3, 

« C is a real positive constant, 

• R is the p-bit FPN obtained by rounding 1/C to p 
bits using round-to-nearest mode, 

• R is a positive normal p-bit FPN, 

• C\ is the (p — 2)-bit FPN obtained by rounding 
1/R to p — 2 bits using round-to-nearest mode, 

• C\ is not exactly a power of 2, 
. Ci > 2 p - 1 A. 

Then \C-C X \ < 4ulp(Ci). 



As C is not too far from Ci, we have that C < 2 p+1 ulp(Ci). 
We now bound C — C\. 

\C-d\ < \C-l/R\ + \l/R-Cx\ < %\R-1/C\ + \1/R- 
d|, so \C - Ci| < §ulp(i?)/2 + 4ulp(Ci)/2 < C2-P- 1 + 
2ulp(Ci), hence the result. 

This means that the formula for C2 given above yields a 
FPN fulfilling the requirements of Theorem [6] 

VI. Conclusions 

We have presented Coq verified theorems that prove the 
correctness and effectiveness of a much faster technique based 
on the commonly used argument reduction in elementary func- 
tion computations, on machines that have hardware support 



for fused-multiply-add instructions. The conditions of these 
theorems are easily met as our analysis indicates. While we 
have showed it is not always possible to use the most accurate 
parameters under all circumstances, an almost best possible 
selection can be used at all times: to zero out the last 2 bits. 

We have presented also a very accurate second step argu- 
ment reduction. We provide a way to compute Ci which is 
not the most precise possible, but is usually 2 bits away from 
it (and can be rounded as needed by the programmer). The 
most interesting part is the possibility to compute with FPNs 
the exact error of the second step of the argument reduction 
and the fact that this error is exactly representable by only one 
FPN. It makes the third step unexpectedly easy as we have a 
mathematical equality between the computed FPNs and a very 
good approximation of x — zC (with a known error). 

Except for the computation of C2, all the rounding used 
should be rounding to nearest, ties to even. But our proofs are 
generic enough to show that our results still hold when using 
rounding to nearest, where cases of ties can be decided in 
any coherent way [25]. This includes rounding to nearest, ties 
away from zero that is found in the revision of the IEEE-754 
standard. 

The formal verification forces us to provide many tedious 
details in the proofs but gives us a guarantee on our results. 
The proposed theorems are sufficient in the sense that effective 
parameters for efficient argument reductions can be obtained 
without any difficulty. 

Our theorems provides us with sufficient conditions for x — 
zC\ to be a FPN. This means that x — zC\ could be a FPN 
even when one or more of the conditions fails for some specific 
values of C, C\ and R as published in the past [1], [7]. We 
may work on this in the future even though there is only a 
limited space for improvement as only the last two bits of C\ 
can be changed to make the constant more accurate. 

The algorithms proved can be applied to any floating- 
point format (IEEE single, double or extended for example). 
Intuitively, the correctness of these algorithms should come 
as natural. Nevertheless, rigorous proofs are not trivial due to 
a few special cases that could have been easily dismissed by 
hand- waving proofs. 
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Appendix 

Theorem [4] can be used for any value of 2 < q < p — 1. 
In most case, users are interested for the smallest possible 
value of q because that will give a more accurate C\ and 
consequently a more accurate reduced argument. For this 



reason, we proved Theorem [5] for q = 2. The following 
theorem is under the hypothesis that 

RCi < 1, 

while 2 < q < p — 1 still. This add-on is enough to guarantee 
cases that are left over by Theorem [4] 



We essentially need to consider how to make R and C\ 
satisfy this new constraint. Since there is no strict connection 
between R, C\ on one hand and C on the other hand, we can 
either use R to be the correctly rounded FPN nearest 1/C7 or 
we may alternatively add or subtract one or a few ulps so that 
the additional inequality is met. 



Theorem 8 (Fmac_arg_reduct_correct2): 
Assume 

• P > 3, 

. 2<q<p-l, 

m x is p-bit FPN, 

• R is a positive normal p-bit FPN, 

• Ci is the (p — g)-bit FPN obtained by rounding 
1 /R to p — q bits using round-to-nearest mode, 

• C\ is not exactly a power of 2, 

• C\ > 2P~ I ? +max ( 1 < 7V - 1 )A, 

. z = {3 • 2P~ N ~ 2 + xR} ima G 3 • 2P- N - 2 , 

. 2~ N is a FPN, 
. \ X R\ < 2P~ N - 2 - 2~ N , 
. RC X < 1. 
Then x — zC\ is a p-bit FPN. 



